Are we alone in the Universe? This is one of the most profound questions that humankind has sought to answer since the beginning of recorded history. We can gain some insight into this mystery with the modern search for exoplanets. The underlying purpose of contemporary exoplanet programs is to discover habitable planets, especially around nearby stars, and find evidence of life elsewhere in the cosmos. Using Earth and its lifeforms as a template to assess habitability, we seek other celestial bodies with conditions similar to our own, i.e., with Earth-like surface gravities and temperatures where liquid water could exist. In order to meet these criteria, most habitable worlds must be a particular mass and a particular distance from their host star. We are interested in analyzing the data for confirmed exoplanets to asses our progress in finding a planet with similar physical characteristics to Earth.
For our investigation, we used data from the NASA Exoplanet Archive, which can be found at the following URL https://exoplanetarchive.ipac.caltech.edu. This archive is an online compilation, collation, and cross-correlation of astronomical data and information on exoplanets and their host stars. The data are vetted by a team of astronomers at the California Institute of Technology (Caltech) under contract with the National Aeronautics and Space Administration (NASA) under the Exoplanet Exploration Program. An extensive overview of the data, services and tools of the archive can be found in a published paper by Akeson, et al. (2013, PASP, 125, 989) in the Publications of the Astronomical Society of the Pacific (PASP). This publication can be found here.
We downloaded our dataset from the NASA Exoplanet Archive on 2018 July 5. The data consists of 354 columns and 3,748 rows of information on confirmed exoplanets and their host stars as well as information about their discovery. Discovery information includes the method used to detect the exoplanet, the locale of the observatories used for detection (i.e., whether ground-based, space-based, or a mixture of both observations were used for detection), and the year of discovery. Physical characteristics of exoplanets present in the archive include planetary mass, orbital period, and orbital semi-major axis. Physical properties of host stars listed in the archive include stellar mass, stellar radius, effective temperature, surface gravity, spectral type, luminosity, and distance from Earth. These physical properties for both exoplanets and their host stars are important for determining the similarity between an exoplanet and Earth, and their detection sensitivity.
We trimmed and removed space characters from the original dataset from the NASA Exoplanet Archive to the 15 columns that are most relevant and directly related to habitable planets. These data are stored in the file planets.csv. The variables for which we are interested as responses are pl_bmasse, pl_orbsmax and st_dist are the planet mass in Earth masses (\(R_\odot = 5.9722 \times 10^{24}\,\rm{kg}\)), the planet orbital semi-major axis in astronomical units (\(\rm{AU} = 1.495978707 \times 10^{11}\,\rm{m}\)), and the distance from our solar system to the exoplanet system in parsecs (\(\rm{pc} = 3.0857 \times 10^{16}\,\rm{m}\)). These response variables are key physical parameters that determine the habitability of a planet and the potential feasibility of reaching such a planet.
To ensure we had a consistent dataset with no empty values for any responses or predictors for which we were considering, we filtered out rows with any empty values and were left with 594 rows.
planets <- read.csv("planets.csv")
planets_good <- planets[complete.cases(planets), ]
The structure of the R object for which we will perform our analysis is as follows.
str(planets_good)
## 'data.frame': 594 obs. of 15 variables:
## $ pl_discmethod: Factor w/ 10 levels "Astrometry","EclipseTimingVariations",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ pl_disc : int 2007 2009 2008 2002 1996 2018 2010 2010 2009 2008 ...
## $ pl_locale : Factor w/ 3 levels "Ground","MultipleLocales",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ st_dist : num 110.6 119.5 76.4 18.1 21.4 ...
## $ st_optmag : num 4.74 5.02 5.23 6.61 6.25 ...
## $ st_teff : num 4742 4213 4813 5338 5750 ...
## $ st_mass : num 2.7 2.78 2.2 0.9 1.08 0.99 1.54 1.54 1.93 0.98 ...
## $ st_rad : num 19 29.79 11 0.93 1.13 ...
## $ st_logg : num 2.31 1.93 2.63 4.45 4.36 2.42 3.5 3.5 4.43 1.71 ...
## $ st_metfe : num -0.35 -0.02 -0.24 0.41 0.06 -0.77 -0.03 -0.03 0.19 -0.46 ...
## $ st_vsini : num 1.2 1.5 2.6 1.6 2.18 3.36 2.77 2.77 38.3 1 ...
## $ pl_orbper : num 326 516 186 1773 798 ...
## $ pl_orbsmax : num 1.29 1.53 0.83 2.93 1.66 ...
## $ pl_orbeccen : num 0.231 0.08 0 0.37 0.68 0.042 0.09 0.29 0.29 0.38 ...
## $ pl_bmasse : num 6166 4685 1526 1481 566 ...
There are two factor variables and 13 numeric variables. The following table briefly describes the variables.
| Variable name | Variable description |
|---|---|
pl_discmethod |
Method of discovery (RadialVelocity, Transit, TransitTimingVariations, etc.) |
pl_disc |
Year of discovery |
pl_locale |
Locale of discovery (Ground, MultipleLocales or Space) |
st_dist |
Stellar distance (parsecs) |
st_optmag |
Stellar apparent magnitude (mag) |
st_teff |
Stellar effective temperature (Kelvin) |
st_mass |
Stellar mass (solar masses) |
st_rad |
Stellar radius (solar radii) |
st_logg |
Stellar surface gravity (log g) |
st_metfe |
Stellar metallicity (log(Fe/H or M/H)) |
st_vsini |
Stellar projected rotational velocity (m/s) |
pl_orbper |
Planet orbital period (days) |
pl_orbsmax |
Planet orbital semi-major axis (AU) |
pl_orbeccen |
Planet orbital eccentricity |
pl_bmasse |
Planet mass (Earth masses) |
Some of these variables may be collinear to each other. We can inspect for collinearity visually using a matrix of scatterplots of the variables.
From the matrix of scatterplots, it appears that planet orbital period (pl_orbper) and orbital semi-major axis (pl_orbsmax) are collinear. This collinearity is not unexpected since orbital period and semi-major axis are related to each other physically as supported in Kepler’s third law of planetary motion. It also appears that stellar radius (st_rad) and stellar surface gravity (st_logg) are collinear as well. Collinearity makes estimating model coefficients and interpreting models more difficult, but it does not affect model predictions. We will keep this in mind as we search for potential models for our data.
The response for which we are most interested for the study exoplanet habitability is planet orbital semi-major axis (pl_orbsmax). The planet orbital semi-major axis is essentially the distance between a planet and its host star. This distance is important to planet habitability because only a certain range of distances between a planet and its host star would allow liquid water to exist on the planet to support life as we know it.
We will first begin with an additive model with pl_orbsmax as the response and the remaining variables as predictors except pl_orbper which is fairly collinear with the response.
pl_orbsmax_mod_add <- lm(pl_orbsmax ~ . - pl_orbper, data = planets_good)
summary(pl_orbsmax_mod_add)
##
## Call:
## lm(formula = pl_orbsmax ~ . - pl_orbper, data = planets_good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7038 -0.7214 -0.1439 0.3867 9.7798
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -9.388e+01 2.440e+01 -3.848
## pl_discmethodTransit -1.097e+00 2.372e-01 -4.626
## pl_discmethodTransitTimingVariations -6.666e-01 9.152e-01 -0.728
## pl_disc 4.715e-02 1.206e-02 3.911
## pl_localeSpace 2.677e-01 1.918e-01 1.395
## st_dist 1.669e-05 2.240e-04 0.075
## st_optmag -1.007e-01 4.485e-02 -2.245
## st_teff 2.832e-04 1.268e-04 2.234
## st_mass 6.088e-02 1.062e-01 0.573
## st_rad -3.025e-02 1.481e-02 -2.042
## st_logg -1.052e-01 1.528e-01 -0.688
## st_metfe 2.051e-01 2.531e-01 0.810
## st_vsini -2.244e-02 7.876e-03 -2.850
## pl_orbeccen 1.095e+00 2.808e-01 3.899
## pl_bmasse 2.497e-04 5.094e-05 4.902
## Pr(>|t|)
## (Intercept) 0.000133 ***
## pl_discmethodTransit 4.61e-06 ***
## pl_discmethodTransitTimingVariations 0.466652
## pl_disc 0.000103 ***
## pl_localeSpace 0.163424
## st_dist 0.940616
## st_optmag 0.025132 *
## st_teff 0.025871 *
## st_mass 0.566649
## st_rad 0.041586 *
## st_logg 0.491539
## st_metfe 0.418113
## st_vsini 0.004532 **
## pl_orbeccen 0.000108 ***
## pl_bmasse 1.23e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.232 on 579 degrees of freedom
## Multiple R-squared: 0.3196, Adjusted R-squared: 0.3031
## F-statistic: 19.42 on 14 and 579 DF, p-value: < 2.2e-16
From the high p-values from the t-statistic for some of the predictors, a simpler model may be preferred.
We will use a backward BIC search to find a potentially better model.
n <- length(resid(pl_orbsmax_mod_add))
pl_orbsmax_mod_back_bic <- step(pl_orbsmax_mod_add, direction = "backward", k = log(n), trace = 0)
summary(pl_orbsmax_mod_back_bic)
##
## Call:
## lm(formula = pl_orbsmax ~ pl_discmethod + pl_disc + st_teff +
## st_vsini + pl_orbeccen + pl_bmasse, data = planets_good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3169 -0.7790 -0.1515 0.3546 9.9785
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -8.418e+01 2.303e+01 -3.654
## pl_discmethodTransit -1.511e+00 1.333e-01 -11.331
## pl_discmethodTransitTimingVariations -9.084e-01 8.826e-01 -1.029
## pl_disc 4.142e-02 1.138e-02 3.638
## st_teff 4.024e-04 1.024e-04 3.928
## st_vsini -2.053e-02 7.686e-03 -2.672
## pl_orbeccen 1.132e+00 2.760e-01 4.100
## pl_bmasse 1.986e-04 4.124e-05 4.816
## Pr(>|t|)
## (Intercept) 0.000281 ***
## pl_discmethodTransit < 2e-16 ***
## pl_discmethodTransitTimingVariations 0.303771
## pl_disc 0.000299 ***
## st_teff 9.58e-05 ***
## st_vsini 0.007757 **
## pl_orbeccen 4.71e-05 ***
## pl_bmasse 1.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.239 on 586 degrees of freedom
## Multiple R-squared: 0.304, Adjusted R-squared: 0.2957
## F-statistic: 36.56 on 7 and 586 DF, p-value: < 2.2e-16
The model found by a backward BIC search is considerably simpler; this model has seven parameters compared to 14 for the additive model. Furthermore, most of the predictors have low p-values associated with their t-statistic. However, the adjusted \(R^2\) of 0.2956573 could be improved.
Next, we will try transform the response pl_orbsmax. Physically, the span of exoplanet orbital semi-major axis is fairly large. Thus, we are motivated to transform the response with a logarithm function and repeat the previous model search.
log_pl_orbsmax_mod_add <- lm(log(pl_orbsmax) ~ . - pl_orbper, data = planets_good)
n <- length(resid(log_pl_orbsmax_mod_add))
log_pl_orbsmax_mod_back_bic <- step(log_pl_orbsmax_mod_add, direction = "backward", k = log(n), trace = 0)
summary(log_pl_orbsmax_mod_back_bic)
##
## Call:
## lm(formula = log(pl_orbsmax) ~ pl_discmethod + pl_locale + st_teff +
## st_rad + st_logg + st_vsini + pl_orbeccen + pl_bmasse, data = planets_good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.93783 -0.67003 0.05108 0.71753 2.83172
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 4.822e-01 6.533e-01 0.738
## pl_discmethodTransit -2.454e+00 1.308e-01 -18.752
## pl_discmethodTransitTimingVariations -1.041e+00 8.001e-01 -1.301
## pl_localeSpace 8.006e-01 1.596e-01 5.017
## st_teff 3.240e-04 9.966e-05 3.251
## st_rad -3.057e-02 1.179e-02 -2.592
## st_logg -7.140e-01 1.163e-01 -6.137
## st_vsini -1.764e-02 6.922e-03 -2.548
## pl_orbeccen 1.730e+00 2.480e-01 6.974
## pl_bmasse 1.702e-04 4.372e-05 3.893
## Pr(>|t|)
## (Intercept) 0.460751
## pl_discmethodTransit < 2e-16 ***
## pl_discmethodTransitTimingVariations 0.193712
## pl_localeSpace 6.98e-07 ***
## st_teff 0.001215 **
## st_rad 0.009782 **
## st_logg 1.55e-09 ***
## st_vsini 0.011076 *
## pl_orbeccen 8.34e-12 ***
## pl_bmasse 0.000111 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.102 on 584 degrees of freedom
## Multiple R-squared: 0.5955, Adjusted R-squared: 0.5892
## F-statistic: 95.52 on 9 and 584 DF, p-value: < 2.2e-16
For this log(pl_orbsmax) model, the adjusted \(R^2\) of 0.5892347 is an improvement from the analogous model for pl_orbsmax. We could consider interaction models, but with nine parameters in this latest model already, we will forego deriving an interaction model for now.
We will check the constant variance assumption for our latest model using a residual versus fitted values plot, and using a Breusch-Pagan test.
By visual inspection, the general span of residuals is not very constant across the range of fitted values. This suggests that homoscedasticity or constant variance is suspect.
Next, we will perform the Breusch-Pagan test on the model.
library(lmtest)
log_pl_orbsmax_mod_back_bic_bp <- bptest(log_pl_orbsmax_mod_back_bic)
The p-value from the Breusch-Pagan test for the model is \(1.6910787\times 10^{-25}\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(1.6910787\times 10^{-25} \ll \alpha\). This further suggests that homoscedasticity is suspect.
We will check the normality assumption for our model using a Q-Q plot, and using a Shapiro-Wilk test.
By visual inspection, the data seems to deviate slightly from normality in the lower part of the Q-Q plot. The normality assumption may be suspect.
Next, we will perform the Shapiro-Wilk test on the model.
log_pl_orbsmax_mod_back_bic_sw <- shapiro.test(resid(log_pl_orbsmax_mod_back_bic))
The p-value from the Shapiro-Wilk test for the model is \(0.0023964\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(0.0023964 \lt \alpha\). This further suggests that normality is suspect.
While we could continue to tweak the model with transformations on the predictors or vary model search approaches like direction (e.g., forward or stepwise) or quality estimator (e.g., AIC) to obtain homoscedasticity and normality, we can also try an exhaustive model search. An exhaustive search that is not too large could allow us to systematically try mutually exclusive combinations of linear and logarithm transformations of the predictors.
We will first setup the storage for the results of the exhaustive model search.
n <- (2^8) * (3^5)
loocv_rmse_results <- rep(0, n)
adj_r2_results <- rep(0, n)
bp_value_results <- rep(0, n)
sw_value_results <- rep(0, n)
num_params_results <- rep(0, n)
We will generate strings that will be parsed and evaluated by R to fit the models. See the appendix for the R code.
We will fit all of the model combinations and store the results.
for (i in 2:n) {
eval(parse(text = paste(lm_str_head, lm_str[i], lm_str_tail)))
adj_r2_results[i] <- get_adj_r2(cur_mod)
bp_value_results[i] <- bptest(cur_mod)$p.value
loocv_rmse_results[i] <- get_loocv_rmse(cur_mod)
num_params_results[i] <- get_num_params(cur_mod)
sw_value_results[i] <- shapiro.test(resid(cur_mod))$p.value
}
We would like to identify models in our collection where we fail to reject the hypothesis of homoscedasticity and normality, and have a reasonably large adjusted \(R^2\) values, e.g., \(\gt 0.5\).
idx <- (bp_value_results > 0.01) & (sw_value_results > 0.01) & (adj_r2_results > 0.5)
There are 98 models where we fail to reject the hypothesis of homoscedasticity and normality for a significance level of \(\alpha = 0.01\).
Of those models, we could choose the model with the lowest LOOCV-RMSE.
eval(parse(text = paste("log_pl_orbsmax_mod_exh = lm(log(pl_orbsmax)~", lm_str[which(idx)[which.min(loocv_rmse_results[which(idx)])]], lm_str_tail)))
We will use a backward BIC search to see if we could find a simpler model with this model as a starting point.
n <- length(resid(log_pl_orbsmax_mod_exh))
log_pl_orbsmax_mod_exh_back_bic <- step(log_pl_orbsmax_mod_exh, direction = "backward", k = log(n), trace = 0)
We will use this model to gain some insight into the orbital semi-major axes of confirmed exoplanets to date.
The R summary for the model we have chosen to examine the orbital semi-major axis of exoplanets is given below.
summary(log_pl_orbsmax_mod_exh_back_bic)
##
## Call:
## lm(formula = log(pl_orbsmax) ~ pl_locale + st_optmag + st_vsini +
## log(pl_bmasse), data = planets_good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8317 -0.8263 -0.0032 0.8318 3.1664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.327302 0.256135 1.278 0.202
## pl_localeSpace 1.118187 0.172783 6.472 2.04e-10 ***
## st_optmag -0.380171 0.020367 -18.666 < 2e-16 ***
## st_vsini -0.040279 0.006106 -6.597 9.35e-11 ***
## log(pl_bmasse) 0.388026 0.029190 13.293 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.178 on 589 degrees of freedom
## Multiple R-squared: 0.5335, Adjusted R-squared: 0.5304
## F-statistic: 168.4 on 4 and 589 DF, p-value: < 2.2e-16
We will check the constant variance assumption for our final model using a residual versus fitted values plot, and using a Breusch-Pagan test.
By visual inspection, the general span of residuals is mostly constant across the range of fitted values although many of the datapoints are clustered in two areas.
Next, we will perform the Breusch-Pagan test on the model.
library(lmtest)
log_pl_orbsmax_mod_exh_back_bic_bp <- bptest(log_pl_orbsmax_mod_exh_back_bic)
The p-value from the Breusch-Pagan test for the model is \(0.3015627\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(1.6910787\times 10^{-25} \gt \alpha\). This suggests that homoscedasticity is not suspect.
We will check the normality assumption for our model using a Q-Q plot, and using a Shapiro-Wilk test.
By visual inspection, the data seems to consistent with normality.
Next, we will perform the Shapiro-Wilk test on the model.
log_pl_orbsmax_mod_exh_back_bic_sw <- shapiro.test(resid(log_pl_orbsmax_mod_exh_back_bic))
The p-value from the Shapiro-Wilk test for the model is \(0.3724764\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(0.3724764 \gt \alpha\). This suggests that normality is not suspect.
The model we have chosen to predict the (logarithm of) planet orbital semi-major axis is an additive model with four predictors which are the locale of discovery, the stellar apparent magnitude, the stellar projected rotational velocity, and the logarithm of planet mass.
get_bp_decision = function(model, alpha) {
decide = unname(bptest(model)$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_sw_decision = function(model, alpha) {
decide = unname(shapiro.test(resid(model))$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_num_params = function(model) {
length(coef(model))
}
get_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
get_adj_r2 = function(model) {
summary(model)$adj.r.squared
}
lm_str_head <- "cur_mod=lm(log(pl_orbsmax)~"
lm_str_tail <- ",data=planets_good)"
lm_str <- rep("", n)
i <- 1
for (ipl_discmethod in c("1", "pl_discmethod")) {
for (ipl_disc in c("1", "pl_disc")) {
for (ipl_locale in c("1", "pl_locale")) {
for (ist_dist in c("1", "st_dist", "log(st_dist)")) {
for (ist_optmag in c("1", "st_optmag")) {
for (ist_teff in c("1", "st_teff", "log(st_teff)")) {
for (ist_mass in c("1", "st_mass", "log(st_mass)")) {
for (ist_rad in c("1", "st_rad", "log(st_rad)")) {
for (ist_logg in c("1", "st_logg")) {
for (ist_metfe in c("1", "st_metfe")) {
for (ist_vsini in c("1", "st_vsini")) {
for (ipl_orbeccen in c("1", "pl_orbeccen")) {
for (ipl_bmasse in c("1", "pl_bmasse", "log(pl_bmasse)")) {
lm_str[i] <- paste(ipl_discmethod,
ipl_disc,
ipl_locale,
ist_dist,
ist_optmag,
ist_teff,
ist_mass,
ist_rad,
ist_logg,
ist_metfe,
ist_vsini,
ipl_orbeccen,
ipl_bmasse,
sep = "+")
i <- i + 1
}
}
}
}
}
}
}
}
}
}
}
}
}